{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f0e29fef",
   "metadata": {},
   "source": [
    "# End-to-End Example Using SQL Databases\n",
    "\n",
    "SuperDuperDB offers the flexibility to connect to various SQL databases, including:\n",
    "\n",
    "- MongoDB\n",
    "- PostgreSQL\n",
    "- SQLite\n",
    "- DuckDB\n",
    "- BigQuery\n",
    "- ClickHouse\n",
    "- DataFusion\n",
    "- Druid\n",
    "- Impala\n",
    "- MSSQL\n",
    "- MySQL\n",
    "- Oracle\n",
    "- Snowflake\n",
    "\n",
    "In this example, we showcase how to implement multimodal vector-search with DuckDB. This is an extension of multimodal vector-search with MongoDB, which is just slightly easier to set up (see [here](https://docs.superduperdb.com/docs/use_cases/items/multimodal_image_search_clip)). Everything demonstrated here applies equally to any of the supported SQL databases mentioned above, as well as to tabular data formats on disk, such as `pandas`.\n",
    "\n",
    "Real life use cases could be vectorizing diverse things like images, texts and searching it efficiently with SuperDuperDB."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff1db9c6",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "Before proceeding with this use-case, ensure that you have installed the necessary software requirements:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38d752ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install superduperdb\n",
    "!pip install torch torchvision openai-clip"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "You also need to clean up the testing environment, just to ensure idempotency across test runs."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "1a1dbd95851caa63"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "!rm -rf .superduperdb/ && mkdir -p .superduperdb"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "ddfab34d502f902b"
  },
  {
   "cell_type": "markdown",
   "id": "dfde8264",
   "metadata": {},
   "source": [
    "## Connect to Datastore\n",
    "\n",
    "The initial step in any `superduperdb` workflow is to connect to your datastore. To connect to a different datastore, simply uncomment the respective section. We demonstrate three different types of connections:\n",
    "1. Embedded Databases: Run directly on this notebook.\n",
    "2. Cloud Databases: Services are running on the cloud.\n",
    "3. Containerized Databases: Services must be manually provisioned using `make testdb_init`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8e7ef91-9eda-4fbd-b34f-b49b5411fc47",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import superduper\n",
    "\n",
    "# ---------------------------------\n",
    "# DuckDB (Embedded)\n",
    "# ---------------------------------\n",
    "!pip install \"ibis-framework[duckdb]\"\n",
    "connection_uri = \"duckdb://.superduperdb/test.duckdb\"\n",
    "\n",
    "# ---------------------------------\n",
    "# SQLite (Embedded)\n",
    "# ---------------------------------\n",
    "# !pip install \"ibis-framework[sqlite]\"\n",
    "# connection_uri = \"sqlite://.superduperdb/test.sqlite\"\n",
    "\n",
    "# ---------------------------------\n",
    "# ClickHouse (Cloud)\n",
    "# ---------------------------------\n",
    "# !pip install \"ibis-framework[clickhouse]\"\n",
    "# user = 'play'\n",
    "# password = ''\n",
    "# host = 'play.clickhouse.com'\n",
    "# port = 443\n",
    "# connection_uri = f\"clickhouse://{user}:{password}@{host}:{port}\"\n",
    "\n",
    "# ---------------------------------\n",
    "# SnowFlake (Cloud)\n",
    "# ---------------------------------\n",
    "# !pip install \"ibis-framework[snowflake]\"\n",
    "# user = \"<user name>\"\n",
    "# password = \"<user password>\"\n",
    "# account = \"<account identifier>\"  \n",
    "# database = \"<database schema>\" \n",
    "# connection_uri = f\"snowflake://{user}:{password}@{account}/{database}\"\n",
    "\n",
    "# ---------------------------------\n",
    "# BigQuery (Cloud)\n",
    "# ---------------------------------\n",
    "# !pip install 'ibis-framework[bigquery]'\n",
    "# project_id=\"ibis-bq-project\"\n",
    "# dataset_id=\"testing\"\n",
    "# connection_uri = f\"bigquery://{project_id}/{dataset_id}\"\n",
    "\n",
    "# ---------------------------------\n",
    "# PostgreSQL (Containerized)\n",
    "# ---------------------------------\n",
    "# !pip install \"ibis-framework[postgres]\" psycopg2-binary\n",
    "# connection_uri = \"postgres://superduper:superduper@localhost:5432/test_db\"\n",
    "\n",
    "# ---------------------------------\n",
    "# MySQL (Containerized)\n",
    "# ---------------------------------\n",
    "# !pip install \"ibis-framework[mysql]\"\n",
    "# connection_uri = \"mysql://superduper:superduper@localhost:3306/test_db\"\n",
    "\n",
    "# ---------------------------------\n",
    "# MS SQL Server (Containerized)\n",
    "# ---------------------------------\n",
    "# !pip install \"ibis-framework[mssql]\"\n",
    "# connection_uri = \"mssql://sa:Superduper#1@localhost:1433/msdb\"\n",
    "\n",
    "    \n",
    "# ---------------------------------\n",
    "\n",
    "# Let's super duper your SQL database\n",
    "db = superduper(connection_uri,\n",
    "                metadata_store=\"sqlite:///.superduperdb/metadata.sqlite\",\n",
    "                )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a5252a3-91a7-4789-85b7-0b8279c228a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "db.show('vector_index')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8794451",
   "metadata": {},
   "source": [
    "## Load Dataset\n",
    "\n",
    "Now that you're connected, add some data to the datastore:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9d2b073-38b0-4d29-aa65-e568f19e7852",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download the coco_sample.zip file\n",
    "![[ -e coco_sample.zip ]] || curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/coco_sample.zip\n",
    "\n",
    "# Download the captions_tiny.json file\n",
    "![[ -e captions_tiny.json ]] || curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/captions_tiny.json\n",
    "\n",
    "# Clean up the testing data directory\n",
    "!rm -rf ./data\n",
    "\n",
    "# Create a directory named 'data/coco'\n",
    "!mkdir -p data/coco\n",
    "\n",
    "# Unzip the contents of coco_sample.zip\n",
    "!unzip coco_sample.zip\n",
    "\n",
    "# Move the 'images_small' directory to 'data/coco/images'\n",
    "!mv images_tiny data/coco/images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec5d36d3-7e74-4c87-92c2-ed1586330858",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import json\n",
    "import pandas as pd\n",
    "from PIL import Image\n",
    "\n",
    "# Open the 'captions_tiny.json' file and load its contents\n",
    "with open('captions_tiny.json') as f:\n",
    "    data = json.load(f)[:500]\n",
    "\n",
    "# Create a DataFrame from a list comprehension with image paths and captions\n",
    "data = pd.DataFrame([\n",
    "    {\n",
    "        'image': r['image']['_content']['path'],\n",
    "        'captions': r['captions']\n",
    "    } for r in data\n",
    "])\n",
    "\n",
    "# Add an 'id' column to the DataFrame\n",
    "data['id'] = pd.Series(data.index).apply(str)\n",
    "\n",
    "# Create a DataFrame with 'id' and 'image' columns\n",
    "images_df = data[['id', 'image']]\n",
    "\n",
    "# Open each image using PIL.Image\n",
    "images_df['image'] = images_df['image'].apply(Image.open)\n",
    "\n",
    "# Create a DataFrame with 'id' and 'captions' columns, exploding the 'captions' column\n",
    "captions_df = data[['id', 'captions']].explode('captions')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b43cd7d2",
   "metadata": {},
   "source": [
    "## Define Schema\n",
    "\n",
    "For this use-case, you need a table with images and another table with text. SuperDuperDB extends standard SQL functionality, allowing developers to define their own data types through the `Encoder` abstraction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d9483f3-78c5-47df-9fa2-4cd070282791",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb.backends.ibis.query import Table\n",
    "from superduperdb.backends.ibis.field_types import dtype\n",
    "from superduperdb.ext.pillow import pil_image\n",
    "from superduperdb import Schema\n",
    "\n",
    "# Define the 'captions' table\n",
    "captions = Table(\n",
    "    'captions',\n",
    "    primary_id='id',\n",
    "    schema=Schema(\n",
    "        'captions-schema',\n",
    "        fields={'id': dtype(str), 'captions': dtype(str)},\n",
    "    )\n",
    ")\n",
    "\n",
    "# Define the 'images' table\n",
    "images = Table(\n",
    "    'images',\n",
    "    primary_id='id',\n",
    "    schema=Schema(\n",
    "        'images-schema',\n",
    "        fields={'id': dtype(str), 'image': pil_image},\n",
    "    )\n",
    ")\n",
    "\n",
    "# Add the 'captions' and 'images' tables to the SuperDuperDB database\n",
    "db.add(captions)\n",
    "db.add(images)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "115b2c14",
   "metadata": {},
   "source": [
    "## Add data to the datastore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93cce29b-dd04-47d2-bdfc-fe3780e06ddb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Insert data from the 'images_df' DataFrame into the 'images' table\n",
    "_ = db.execute(images.insert(images_df))\n",
    "\n",
    "# Insert data from the 'captions_df' DataFrame into the 'captions' table\n",
    "_ = db.execute(captions.insert(captions_df))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "def10282",
   "metadata": {},
   "source": [
    "## Build SuperDuperDB `Model` Instances\n",
    "\n",
    "This use-case utilizes the `superduperdb.ext.torch` extension. Both models use `torch` tensors in their output, which are encoded with `tensor`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d87ed43d-6f90-46c1-8851-6050ae21a051",
   "metadata": {},
   "outputs": [],
   "source": [
    "import clip\n",
    "import torch\n",
    "from superduperdb.ext.torch import TorchModel, tensor\n",
    "\n",
    "# Load the CLIP model\n",
    "model, preprocess = clip.load(\"RN50\", device='cpu')\n",
    "\n",
    "# Define a tensor type\n",
    "t = tensor(torch.float, shape=(1024,))\n",
    "\n",
    "# Create a TorchModel for text encoding\n",
    "text_model = TorchModel(\n",
    "    identifier='clip_text',\n",
    "    object=model,\n",
    "    preprocess=lambda x: clip.tokenize(x)[0],\n",
    "    encoder=t,\n",
    "    forward_method='encode_text',    \n",
    ")\n",
    "\n",
    "# Create a TorchModel for visual encoding\n",
    "visual_model = TorchModel(\n",
    "    identifier='clip_image',\n",
    "    object=model.visual,    \n",
    "    preprocess=preprocess,\n",
    "    encoder=t,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89c5c236",
   "metadata": {},
   "source": [
    "## Create a Vector-Search Index\n",
    "\n",
    "Define a multimodal search index based on the imported models. The `visual_model` is applied to the images, making the `images` table searchable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb8aef2c-484f-41a9-9956-d80b9c58eaa8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import VectorIndex, Listener\n",
    "\n",
    "# Add a VectorIndex\n",
    "db.add(\n",
    "    VectorIndex(\n",
    "        'my-index',\n",
    "        indexing_listener=Listener(\n",
    "            model=visual_model,\n",
    "            key='image',\n",
    "            select=images,\n",
    "        ),\n",
    "        compatible_listener=Listener(\n",
    "            model=text_model,\n",
    "            key='captions',\n",
    "            active=False,\n",
    "            select=None,\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d8b9e84",
   "metadata": {},
   "source": [
    "## Search Images Using Text\n",
    "\n",
    "Now, let's demonstrate how to search for images using text queries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1731a574-921a-4cba-a65c-26bff9fb9c8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import Document\n",
    "\n",
    "# Execute a query to find images with captions containing 'dog catches frisbee'\n",
    "res = db.execute(\n",
    "    images\n",
    "        .like(Document({'captions': 'dog catches frisbee'}), vector_index='my-index', n=10)\n",
    "        .limit(10)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72522031-0af8-452a-bbd1-b27dede55154",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the image data from the fourth result in the search\n",
    "res[3]['image'].x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false
   },
   "id": "225bc82712beb398"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
